Make pp_stream wait on attn_backward_dx #10984
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Models
Description
让 pp_stream 等待 attn_backward_dx,解决开启 overlap_p2p_comm 时遇到的 loss 下降速度慢的问题
下图显示了修复前和修复后的等待关系

其实我也不知道为什么加这条等待就行,我只是通过二分法定位到是 PP(F) 的问题,然后试着加了等待,然后 loss 就正常了,估计跟跨 stream 分配显存有关,我通过单测发现 Paddle 的跨 stream 分配显存有一些不安全的情况,虽然模型里看起来没有不安全的用法,但也不好说,所以还是保守一点
对性能有一定影响,因为把 PP(F) 推后了,该 PR 还需要改进
正常情况下,单机配置(29 Decoder + 1 MTP),跑200个step,loss应该下降到7.3;在本PR之前开启 overlap_p2p_comm,loss 只能降到8.7;现在开不开都能降到7.3